Stroke is one of the leading causes of death and disability worldwide and remains a major public health challenge[1]. Early identification of high-risk individuals is crucial for prevention and timely intervention. Therefore we develop and fit a Logistic Regression model using key health indicators to evaluate the effectiveness of a much simpler method.
Occam’s Razor: The simplest solution is always the best
The Binary Logistic ModelThe Logistic Regression model uses the logit link function to model the probability of the outcome (\(\pi = P[Y =1]\)):
\[ln\left(\frac{\pi}{1-\pi}\right) = \beta_{0} + \beta_{1}x_{1} + \cdots + \beta_{k}x_{k} \quad \text{}\]
The main characteristic of the binary logistic regression is the type of dependent (or outcome) variable.[2] A dependent variable in a binary logistic regression has only two levels.
The Stroke Prediction Dataset[3] started containing 5,110 observations and 12 features. After cleaning missing and inconsistent entries among other necessarychanges, ended as a dataset containing 3,357 observations and 11 predictors commonly associated with cerebrovascular risk. Those key predictors are listed below.
| Feature Name | Description | Data Type | Values |
|---|---|---|---|
| gender | Patient’s gender | Numeric | 1 (Male), 0 (Female) |
| age | Patient’s age in years | Numeric | Range 0.08 to 82; rounded to 2 decimal places |
| hypertension | Indicates if the patient has hypertension | Numeric | 0 (No), 1 (Yes) |
| heart_disease | Indicates if the patient has any heart diseases | Numeric | 0 (No), 1 (Yes) |
| ever_married | Whether the patient has ever been married | Numeric | 1 (Yes), 0 (No) |
| work_type | Type of occupation | Numeric | 1 (Govt_job), 2 (Private), 3 (Self-employed), 4 (Never_worked) |
| Residence_type | Patient’s area of residence | Numeric | 1 (Urban), 2 (Rural) |
| avg_glucose_level | Average glucose level in blood | Numeric | Range ≈55.12 to 271.74 |
| bmi | Body Mass Index | Numeric | Range ≈10.3 to 97.6; converted from character, rounded to 2 decimals |
| smoking_status | Patient’s smoking status | Numeric | 1 (never smoked), 2 (formerly smoked), 3 (smokes) |
| stroke | Target Variable: Whether the patient had stroke | Numeric | 0 (No Stroke), 1 (Stroke) |
We can observe from the histograms (a), (b), (c) and (d) the following:
Histogram of (a)gender, (b)age, (c)hypertension, (d)heart_disease.
We can observe from the histograms (e), (f), (g) and (h) the following:
Histogram of (e)ever_married, (f)work_type, (g)Residence_type, (h)avg_gloucose_level.
We can observe from the histograms (i) and (j) the following:
Histogram of (i)bmi, (j)smoking_status.
There is an massive increase in the \(\chi^2\) values which demonstrates that the oversampling technique has significantly increased the statistical power of the model.
| Factor | LR χ² (Original, anova2) | LR χ² (Balanced, anova3) | Change in χ² |
|---|---|---|---|
| age | 120.407 | 1201.85 | ≈10.0× Increase |
| hypertension | 18.205 | 154.34 | ≈8.5× Increase |
| avg_glucose_level | 11.337 | 69.73 | ≈6.1× Increase |
The imbalanced models achieved high Accuracy (\(\approx 94.5\%\)) and Specificity (\(\approx 0.998\)), but are practically useless for stroke prediction with a near-zero Sensitivity and missing almost all actual stroke cases.
Model 3 which utilized oversampling to address the severe class imbalance in stroke outcome, demonstrated a significant improvement in predictive capability: Sensitivity dramatically improved to \(0.6481\) being able to identifying 35 True Positives.
| Metric | Model 1 (Full, Imbalanced) | Model 2 (Reduced, Imbalanced) | Model 3 (Reduced, Balanced) |
|---|---|---|---|
| Accuracy | 0.94538 | 0.9444 | 0.7269 |
| Sensitivity (Recall) | 0.01852 | 0.0000 | 0.6481 |
| Specificity | 0.99790 | 0.9979 | 0.7314 |
| True Positives (TP) | 1 | 0 | 35 |
| False Negatives (FN) | 2 | 2 | 256 |
| True Negatives (TN) | 951 | 951 | 697 |
| False Positives (FP) | 53 | 54 | 19 |
Logistic Regression althought being a very simple and interpretable baseline model for stroke risk prediction. During the project we were able to evaluate its weaknesses when analysing a heavily unbalanced dataset such as the Stroke Prediction Dataset. Furthermore, logistic regression models major weaknesses is not being able to determine causal relationship.[7]
We could evaluate that addressing class imbalance via oversampling (ROSE) was critical for achieving a model that can somewhat successfully predict the outcome of stroke. But this is nowhere near the same precision and accuracy of Ensemble Modeling such the Dense Stacking Ensemble (DSE) Model applied at[8] which used a meta-classifier to combine the strengths of simpler models with higher-performing complex models.